Scripted vs Spontaneous Speech
Contents
Scripted vs Spontaneous Speech#
In this section, we explore the differences between scripted speech and casual/spontaneous speech. Both speaking styles feature minimal vocal variations yet impactful. It has been observed that speaking style could affect voice perception in humans in case of unfamiliar voices (Smith et al. (2019), Stevenage et al. (2021) and Afshan et al. (2022)). Accordingly, we are going to investigate the effect of speaking style on generating speech embeddings that should maintain close distances with samples from the same speaker.
Research Questions:#
Is there a noticeable within-speaker difference between scripted and spontaneous speech utterances?
Would the difference change depending on the type of feature extrator used?
Is this difference maintained in lower dimensions?
Dataset Description:#
The dataset used in this experiment is obtained from here. We compiled speech utterances from 26 speakers (14 females and 12 males). The collected dataset comprises 7 tasks (4 scripted/3 spontaneous).
Tasks:
NWS (script): Reading ‘The North Wind and Sun’ passage
LPP (script): Reading ‘The Little Prince’ scentences
DHR (script): Reading ‘Declaration of Human Rights’ scentences
HT2 (script): Reading ‘Hearing in Noise Test 2’ scentences
QNA (spon): Answering questions ‘Q and A session’
ST1 (spon): Telling a personal story 1
ST2 (spon): Telling a personal story 2
The dataset was processed by undersampling to 16 kHz to be compatible with BYOL-S models. Additionally, the utterances were cropped to fixed durations (1, 3, 5, 10, 15 sec) to yield 5 new datasets generated from the original one.
Finally, the naming convention for the audio files is: {ID}{Gender}{Task}{Label}{File Number}.wav (e.g. 049_F_DHR_script_000.wav).
In the following analysis, we will be using the 3sec-utterance version of the dataset.
1) Loading Data#
import deciphering_enigma
#define the experiment config file path
path_to_config = './config.yaml'
#read the experiment config file
exp_config = deciphering_enigma.load_yaml_config(path_to_config)
dataset_path = exp_config.dataset_path
#register experiment directory and read wav files' paths
audio_files = deciphering_enigma.build_experiment(exp_config)
print(f'Dataset has {len(audio_files)} samples')
Dataset has 6471 samples
if exp_config.preprocess_data:
dataset_path = deciphering_enigma.preprocess_audio_files(audio_files, speaker_ids=metadata_df['ID'], chunk_dur=exp_config.chunk_dur, resampling_rate=exp_config.resampling_rate,
save_path=f'{exp_config.dataset_name}_{exp_config.model_name}/preprocessed_audios', audio_format=audio_format)
#balance data to have equal number of labels per speaker
audio_files = deciphering_enigma.balance_data()
print(f'After Balancing labels: Dataset has {len(audio_files)} samples')
#extract metadata from file name convention
metadata_df, audio_format = deciphering_enigma.extract_metadata(exp_config, audio_files)
#load audio files as torch tensors to get ready for feature extraction
audio_tensor_list = deciphering_enigma.load_dataset(audio_files, cfg=exp_config, speaker_ids=metadata_df['ID'], audio_format=audio_format)
After Balancing labels: Dataset has 5816 samples
Audio Tensors are already saved for scriptvsspon_speech
2) Generating Embeddings#
We are generating speech embeddings from 9 different models (BYOL-A, BYOL-S/CNN, BYOL-S/CvT, Hybrid BYOL-S/CNN, Hybrid BYOL-S/CvT, Wav2Vec2, HuBERT and Data2Vec).
#generate speech embeddings
embeddings_dict = deciphering_enigma.extract_models(audio_tensor_list, exp_config)
Load BYOL-A_default Model
BYOL-A_default embeddings are already saved for scriptvsspon_speech
(5816, 2048)
Load BYOL-S_default Model
BYOL-S_default embeddings are already saved for scriptvsspon_speech
(5816, 2048)
Load Hybrid_BYOL-S_default Model
Hybrid_BYOL-S_default embeddings are already saved for scriptvsspon_speech
(5816, 2048)
Load BYOL-S_cvt Model
BYOL-S_cvt embeddings are already saved for scriptvsspon_speech
(5816, 2048)
Load Hybrid_BYOL-S_cvt Model
Hybrid_BYOL-S_cvt embeddings are already saved for scriptvsspon_speech
(5816, 2048)
Load TRILLsson Model
TRILLsson embeddings are already saved for scriptvsspon_speech
(5816, 1024)
Load Wav2Vec2 Model
Wav2Vec2 embeddings are already saved for scriptvsspon_speech
(5816, 1024)
Load HuBERT Model
HuBERT embeddings are already saved for scriptvsspon_speech
(5816, 1280)
Load Data2Vec Model
Data2Vec embeddings are already saved for scriptvsspon_speech
(5816, 1024)
3) Original Dimension Analysis#
3.1. Distance-based#
Compute distances (e.g. cosine distance) across embeddings of utterances. Steps to compute it:
Compute distances across all 5816 samples in a pairwise format (5816*5816).
Convert pairwise form to long form i.e. two long columns [Sample1, Sample2, Distance], yielding a dataframe of 5816*5816 long.
Remove rows with zero distances (i.e. distances between a sample and itself).
Keep only the distances between samples from the same speaker and the same label (e.g. Dist{speaker1_Label1_audio0 –> speaker1_Label1_audio1}), as shown in figure below.
Remove duplicates, i.e. distance between 0 –> 1 == 1 –> 0.
Standardize distances within each speaker to account for within speaker variability space.
Remove the distances above 99% percentile (outliers).
Plot violin plot for each model, split by the label to see how are these models encode both labels.

df_all = deciphering_enigma.compute_distances(metadata_df, embeddings_dict, exp_config.dataset_name, 'cosine', list(metadata_df.columns))
DF for the cosine distances using BYOL-A_default already exist!
DF for the cosine distances using BYOL-S_default already exist!
DF for the cosine distances using Hybrid_BYOL-S_default already exist!
DF for the cosine distances using BYOL-S_cvt already exist!
DF for the cosine distances using Hybrid_BYOL-S_cvt already exist!
DF for the cosine distances using TRILLsson already exist!
DF for the cosine distances using Wav2Vec2 already exist!
DF for the cosine distances using HuBERT already exist!
DF for the cosine distances using Data2Vec already exist!
deciphering_enigma.visualize_violin_dist(df_all)
3.2. Similarity Representation Analysis:#
import numpy as np
from tqdm import tqdm
cka_class = deciphering_enigma.CKA(unbiased=True, kernel='rbf', rbf_threshold=0.5)
num_models = len(embeddings_dict.keys())
cka_ = np.zeros((num_models, num_models))
print(cka_.shape)
for i, (_, model_1) in enumerate(tqdm(embeddings_dict.items())):
for j, (_, model_2) in enumerate(embeddings_dict.items()):
cka_[i,j] = cka_class.compute(model_1, model_2)
0%| | 0/9 [00:00<?, ?it/s]
(9, 9)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [04:51<00:00, 32.36s/it]
cka_class.plot_heatmap(cka_, embeddings_dict.keys(), save_path=f'{exp_config.dataset_name}', save_fig=True)
4) Dimensionality Reduction#
The previous analysis showed how well the model is capable of grouping the uttereances of the same speaker in different cases (scripted and spontaneous) in the embedding space (high dimension). That being said, we will replicate the same analysis but in the lower dimension space to visualize the impact of speaking styles on voice identity perception.
Accordingly, we will utilize different kind of dimensionality reduction such as PCA, tSNE, UMAP and PaCMAP to get a better idea of how the speakers’ samples are clustered together in 2D. However, one constraint is that these methods are sensitive to their hyperparameters (except PCA) which could imapct our interpretation of the results. Thus, a grid search across the hyperparameters for each method is implemented.
Another issue would be quantifying the ability of these methods to perserve the distances amongst samples in the high dimension and present it in a lower dimension. To address this, we are using two metrics KNN and CPD that represent the ability of the algorithm to preserve local and global structures of the original embedding space, respectively. Both metrics are adopted from this paper in which they define both metrics as follows:
KNN: The fraction of k-nearest neighbours in the original high-dimensional data that are preserved as k-nearest neighbours in the embedding. KNN quantifies preservation of the local, or microscopic structure. The value of K used here is the min number of samples a speaker would have in the original space.
CPD: Spearman correlation between pairwise distances in the high-dimensional space and in the embedding. CPD quantifies preservation of the global, or macroscropic structure. Computed across all pairs among 1000 randomly chosen points with replacement.
Consequently, we present the results from dimensionality reduction methods in two ways, one optimimizing local structure metric (KNN) and the other optimizing global structure metric (CPD).
4.1 Mapping Labels#
tuner = deciphering_enigma.ReducerTuner()
for i, model_name in enumerate(embeddings_dict.keys()):
tuner.tune_reducer(embeddings_dict[model_name], metadata=metadata_df, dataset_name=exp_config.dataset_name, model_name=model_name)
Tuned Reduced Embeddings already saved for BYOL-A_default model!
Tuned Reduced Embeddings already saved for BYOL-S_default model!
Tuned Reduced Embeddings already saved for Hybrid_BYOL-S_default model!
Tuned Reduced Embeddings already saved for BYOL-S_cvt model!
Tuned Reduced Embeddings already saved for Hybrid_BYOL-S_cvt model!
Tuned Reduced Embeddings already saved for TRILLsson model!
Tuned Reduced Embeddings already saved for Wav2Vec2 model!
Tuned Reduced Embeddings already saved for HuBERT model!
Tuned Reduced Embeddings already saved for Data2Vec model!
import seaborn as sns
def visualize_embeddings(df, label_name, metrics=[], axis=[], acoustic_param={}, opt_structure='Local', plot_type='sns', red_name='PCA', row=1, col=1, hovertext='', label='spon'):
if plot_type == 'sns':
if label_name == 'Gender':
sns.scatterplot(data=df, x=(red_name, opt_structure, 'Dim1'), y=(red_name, opt_structure, 'Dim2'), hue=label_name, palette='deep', ax=axis)
else:
sns.scatterplot(data=df, x=(red_name, opt_structure, 'Dim1'), y=(red_name, opt_structure, 'Dim2'), hue=label_name
, style=label_name, palette='deep', ax=axis)
axis.set(xlabel=None, ylabel=None)
axis.get_legend().remove()
elif plot_type == 'plotly':
traces = px.scatter(x=df[red_name, opt_structure, 'Dim1'], y=df[red_name, opt_structure, 'Dim2'], color=df[label_name].astype(str), hover_name=hovertext)
traces.layout.update(showlegend=False)
axis.add_traces(
list(traces.select_traces()),
rows=row, cols=col
)
else:
points = axis.scatter(df[red_name, opt_structure, 'Dim1'], df[red_name, opt_structure, 'Dim2'],
c=df[label_name], s=20, cmap="Spectral")
return points
4.1.1. Mapping Gender#
import matplotlib.pyplot as plt
import pandas as pd
fig, ax = plt.subplots(9, 4, figsize=(40, 90))
optimize = 'Global'
reducer_names = ['PCA', 'tSNE', 'UMAP', 'PaCMAP']
for i, model_name in enumerate(embeddings_dict.keys()):
df = pd.read_csv(f'../{exp_config.dataset_name}/{model_name}/dim_reduction.csv', header=[0,1,2])
df.rename(columns={'Unnamed: 17_level_1': '', 'Unnamed: 17_level_2': '', 'Unnamed: 18_level_1': '', 'Unnamed: 18_level_2': '', 'Unnamed: 19_level_1': '', 'Unnamed: 19_level_2': '', 'Unnamed: 20_level_1': '', 'Unnamed: 20_level_2': '',
'Unnamed: 21_level_1': '', 'Unnamed: 21_level_2': '',},inplace=True)
for j, name in enumerate(reducer_names):
ax[0,j].set_title(f'{name}', fontsize=25)
visualize_embeddings(df, 'Gender', metrics=[], axis=ax[i, j], opt_structure=optimize, red_name=name, plot_type='sns')
ax[i, 0].set_ylabel(model_name, fontsize=25)
ax[0,j].legend(bbox_to_anchor=(1, 1.15), fontsize=20)
plt.tight_layout()
4.1.2. Mapping Identity#
fig, ax = plt.subplots(9, 4, figsize=(40, 90))
optimize = 'Global'
reducer_names = ['PCA', 'tSNE', 'UMAP', 'PaCMAP']
for i, model_name in enumerate(embeddings_dict.keys()):
df = pd.read_csv(f'../{exp_config.dataset_name}/{model_name}/dim_reduction.csv', header=[0,1,2])
df.rename(columns={'Unnamed: 17_level_1': '', 'Unnamed: 17_level_2': '', 'Unnamed: 18_level_1': '', 'Unnamed: 18_level_2': '', 'Unnamed: 19_level_1': '', 'Unnamed: 19_level_2': '', 'Unnamed: 20_level_1': '', 'Unnamed: 20_level_2': '',
'Unnamed: 21_level_1': '', 'Unnamed: 21_level_2': '',},inplace=True)
for j, name in enumerate(reducer_names):
ax[0,j].set_title(f'{name}', fontsize=25)
visualize_embeddings(df, 'ID', metrics=[], axis=ax[i, j], opt_structure=optimize, red_name=name, plot_type='sns')
ax[i, 0].set_ylabel(model_name, fontsize=25)
plt.tight_layout()
4.1.3. Mapping Speaking Style (Script/Spon)#
fig, ax = plt.subplots(9, 4, figsize=(40, 90))
optimize = 'Global'
reducer_names = ['PCA', 'tSNE', 'UMAP', 'PaCMAP']
for i, model_name in enumerate(embeddings_dict.keys()):
df = pd.read_csv(f'../{exp_config.dataset_name}/{model_name}/dim_reduction.csv', header=[0,1,2])
df.rename(columns={'Unnamed: 17_level_1': '', 'Unnamed: 17_level_2': '', 'Unnamed: 18_level_1': '', 'Unnamed: 18_level_2': '', 'Unnamed: 19_level_1': '', 'Unnamed: 19_level_2': '', 'Unnamed: 20_level_1': '', 'Unnamed: 20_level_2': '',
'Unnamed: 21_level_1': '', 'Unnamed: 21_level_2': '',},inplace=True)
for j, name in enumerate(reducer_names):
ax[0,j].set_title(f'{name}', fontsize=25)
visualize_embeddings(df, 'Label', metrics=[], axis=ax[i, j], opt_structure=optimize, red_name=name, plot_type='sns')
ax[i, 0].set_ylabel(model_name, fontsize=25)
ax[0,j].legend(bbox_to_anchor=(1, 1.15), fontsize=20)
plt.tight_layout()
4.2 Distance in Lower Dimensions#
labels = ['script', 'spon']
dfs = []
for label in labels:
df = pd.read_csv(f'{label}_dataset.csv', header=[0,1,2])
df.rename(columns={'Unnamed: 17_level_1': '', 'Unnamed: 17_level_2': '', 'Unnamed: 18_level_1': '', 'Unnamed: 18_level_2': '', 'Unnamed: 19_level_1': '', 'Unnamed: 19_level_2': ''},inplace=True)
pacmap_global_df = df.loc[:, ('PaCMAP', 'Global')]
pacmap_global_df['wav_file'] = df['wav_file']; pacmap_global_df['label'] = label
dfs.append(pacmap_global_df)
df = pd.concat(dfs, axis=0)
df.sample(10)
| Dim1 | Dim2 | wav_file | label | |
|---|---|---|---|---|
| 948 | -16.733345 | -1.996756 | 058_F_HT2_script_001.wav | script |
| 1696 | -7.589085 | -6.348401 | 065_F_DHR_script_010.wav | script |
| 2734 | -13.431674 | -4.555548 | 132_M_QNA_spon_049.wav | spon |
| 1765 | -7.870420 | -6.359378 | 065_F_LPP_script_016.wav | script |
| 1128 | 5.016654 | 5.860515 | 059_M_LPP_script_022.wav | script |
| 1121 | 4.929832 | 5.776471 | 059_M_LPP_script_015.wav | script |
| 428 | 16.376358 | 3.609564 | 052_M_NWS_script_002.wav | script |
| 2000 | 5.581617 | -12.904393 | 067_F_QNA_spon_083.wav | spon |
| 1281 | -1.567527 | 11.817277 | 061_M_QNA_spon_029.wav | spon |
| 2062 | -12.941590 | -5.509135 | 068_F_HT2_script_004.wav | script |
#create distance-based dataframe between all data samples in a square form
pairwise = pd.DataFrame(
squareform(pdist(df.iloc[:, :2], metric='cosine')),
columns = df['wav_file'],
index = df['wav_file']
)
#move from square form DF to long form DF
long_form = pairwise.unstack()
#rename columns and turn into a dataframe
long_form.index.rename(['Sample_1', 'Sample_2'], inplace=True)
long_form = long_form.to_frame('Distance').reset_index()
#remove the distances computed between same samples (distance = 0)
long_form = long_form.loc[long_form['Sample_1'] != long_form['Sample_2']]
long_form.sample(10)
| Sample_1 | Sample_2 | Distance | |
|---|---|---|---|
| 29460525 | 069_F_QNA_spon_032.wav | 072_F_DHR_script_025.wav | 0.726305 |
| 16264862 | 133_M_DHR_script_015.wav | 052_M_ST2_spon_022.wav | 1.053295 |
| 18851113 | 052_M_QNA_spon_012.wav | 063_F_DHR_script_001.wav | 0.662966 |
| 26444879 | 064_F_QNA_spon_073.wav | 071_F_QNA_spon_095.wav | 0.012889 |
| 28205636 | 067_F_QNA_spon_024.wav | 058_F_QNA_spon_038.wav | 0.259614 |
| 4276696 | 056_F_HT2_script_048.wav | 067_F_DHR_script_019.wav | 0.183844 |
| 28975777 | 068_F_QNA_spon_048.wav | 053_M_DHR_script_029.wav | 0.395303 |
| 19771496 | 053_M_QNA_spon_055.wav | 049_F_QNA_spon_004.wav | 1.996565 |
| 11123129 | 066_M_NWS_script_008.wav | 049_F_QNA_spon_029.wav | 0.980920 |
| 23103239 | 059_M_QNA_spon_025.wav | 068_F_LPP_script_002.wav | 1.238760 |
#add columns for meta-data
long_form['Gender'] = long_form.apply(lambda row: row['Sample_1'].split('_')[1] if row['Sample_1'].split('_')[1] == row['Sample_2'].split('_')[1] else 'Different', axis=1)
long_form['Label'] = long_form.apply(lambda row: row['Sample_1'].split('_')[3] if row['Sample_1'].split('_')[3] == row['Sample_2'].split('_')[3] else 'Different', axis=1)
long_form['ID'] = long_form.apply(lambda row: row['Sample_1'].split('_')[0] if row['Sample_1'].split('_')[0] == row['Sample_2'].split('_')[0] else 'Different', axis=1)
long_form.sample(10)
| Sample_1 | Sample_2 | Distance | Gender | Label | ID | |
|---|---|---|---|---|---|---|
| 11829677 | 068_F_DHR_script_007.wav | 133_M_QNA_spon_060.wav | 0.050603 | Different | Different | Different |
| 2004548 | 052_M_DHR_script_023.wav | 058_F_QNA_spon_030.wav | 0.672418 | Different | Different | Different |
| 32927936 | 132_M_QNA_spon_068.wav | 056_F_QNA_spon_008.wav | 1.999721 | Different | spon | Different |
| 32933531 | 132_M_QNA_spon_069.wav | 052_M_ST2_spon_035.wav | 0.223480 | M | spon | Different |
| 11423077 | 067_F_HT2_script_013.wav | 053_M_DHR_script_017.wav | 1.999219 | Different | script | Different |
| 22509950 | 058_F_QNA_spon_056.wav | 068_F_DHR_script_004.wav | 0.411826 | F | Different | Different |
| 2862502 | 053_M_HT2_script_025.wav | 058_F_NWS_script_003.wav | 1.976306 | Different | script | Different |
| 29512461 | 069_F_QNA_spon_041.wav | 068_F_HT2_script_019.wav | 1.277754 | F | Different | Different |
| 7549556 | 061_M_HT2_script_016.wav | 052_M_LPP_script_000.wav | 0.004711 | M | script | Different |
| 6979211 | 060_F_HT2_script_017.wav | 049_F_DHR_script_011.wav | 0.610212 | F | script | Different |
#remove distances computed between different speakers and different labels
df = long_form.loc[(long_form['Gender']!='Different') & (long_form['Label']!='Different') & (long_form['ID']!='Different')]
df.sample(10)
| Sample_1 | Sample_2 | Distance | Gender | Label | ID | |
|---|---|---|---|---|---|---|
| 10517184 | 066_M_DHR_script_015.wav | 066_M_HT2_script_024.wav | 0.000549 | M | script | 066 |
| 9173513 | 064_F_DHR_script_012.wav | 064_F_NWS_script_007.wav | 0.106249 | F | script | 064 |
| 17997754 | 050_M_ST1_spon_004.wav | 050_M_QNA_spon_044.wav | 0.000895 | M | spon | 050 |
| 5630860 | 058_F_HT2_script_021.wav | 058_F_HT2_script_025.wav | 0.000425 | F | script | 058 |
| 25967078 | 063_F_QNA_spon_100.wav | 063_F_QNA_spon_090.wav | 0.002420 | F | spon | 063 |
| 31720064 | 072_F_QNA_spon_085.wav | 072_F_QNA_spon_048.wav | 0.036492 | F | spon | 072 |
| 2536298 | 053_M_DHR_script_000.wav | 053_M_LPP_script_025.wav | 0.001557 | M | script | 053 |
| 4409184 | 056_F_LPP_script_012.wav | 056_F_DHR_script_012.wav | 0.001254 | F | script | 056 |
| 19603322 | 053_M_QNA_spon_026.wav | 053_M_QNA_spon_058.wav | 0.350624 | M | spon | 053 |
| 16561024 | 133_M_HT2_script_027.wav | 133_M_LPP_script_014.wav | 0.000339 | M | script | 133 |
fig, ax = plt.subplots(1, 1, figsize=(10, 8))
sns.violinplot(data=df, x='Label', y='Distance', inner='quartile', ax=ax)
ax.set_xlabel('Labels', fontsize=15)
ax.set_ylabel('Cosine Distances', fontsize=15)
# statistical annotation
d=cohend(df['Distance'].loc[(df.Label=='spon')], df['Distance'].loc[(df.Label=='script')])
x1, x2 = 0, 1
y, h, col = df['Distance'].max() + 0.05, 0.01, 'k'
plt.plot([x1, x1, x2, x2], [y, y+h, y+h, y], lw=1.5, c=col)
plt.text((x1+x2)*.5, y+(h*1.5), f'cohen d={d:.2}', ha='center', va='bottom', color=col)
plt.tight_layout()
5) Identity Prediction from Scripted vs Spontaneous speech#
Here, we want to see the ability of speech embeddings generated from scripted/spontaneous samples to predict speaker identity and compare both performances.
#split train and test samples for each participant
spon_df = df.loc[df.Label=='spon']
script_df = df.loc[df.Label=='script']
spon_train=[]; spon_test = []
script_train=[]; script_test = []
for speaker in df['Speaker_ID'].unique():
speaker_spon_df = spon_df.loc[spon_df.Speaker_ID == speaker]
speaker_script_df = script_df.loc[script_df.Speaker_ID == speaker]
msk = np.random.rand(len(speaker_spon_df)) < 0.7
spon_train.append(speaker_spon_df[msk])
spon_test.append(speaker_spon_df[~msk])
script_train.append(speaker_script_df[msk])
script_test.append(speaker_script_df[~msk])
train_spon_df = pd.concat(spon_train)
test_spon_df = pd.concat(spon_test)
train_script_df = pd.concat(script_train)
test_script_df = pd.concat(script_test)
train_spon_features = train_spon_df.iloc[:, 4:]
train_spon_labels = train_spon_df['Speaker_ID']
test_spon_features = test_spon_df.iloc[:, 4:]
test_spon_labels = test_spon_df['Speaker_ID']
train_script_features = train_script_df.iloc[:, 4:]
train_script_labels = train_script_df['Speaker_ID']
test_script_features = test_script_df.iloc[:, 4:]
test_script_labels = test_script_df['Speaker_ID']
5.1 Identity prediction from spontaneous samples#
clf_names, clfs, params_clf = get_sklearn_models()
grid_results = {}
for i, (clf_name, clf, clf_params) in enumerate(zip(clf_names, clfs, params_clf)):
print(f'Step {i+1}/{len(clf_names)}: {clf_name}...')
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=_RANDOM_SEED)
pipeline = Pipeline([('transformer', StandardScaler()), ('estimator', clf)])
grid_search = GridSearchCV(pipeline, param_grid=clf_params, n_jobs=-1, cv=cv, scoring='recall_macro', error_score=0)
grid_result = grid_search.fit(train_spon_features, train_spon_labels)
grid_results[clf_name] = grid_result
test_result = grid_result.score(test_spon_features, test_spon_labels)
print(f'Best {clf_name} UAR: {grid_result.best_score_*100: .2f} using {grid_result.best_params_}')
print(f' Test Data UAR: {test_result*100: .2f}')
Step 1/3: LR...
Best LR UAR: 99.00 using {'estimator__C': 100.0, 'estimator__class_weight': None}
Test Data UAR: 99.06
Step 2/3: RF...
Best RF UAR: 92.84 using {'estimator__class_weight': 'balanced', 'estimator__max_depth': 25, 'estimator__min_samples_split': 2}
Test Data UAR: 91.20
Step 3/3: SVC...
Best SVC UAR: 98.43 using {'estimator__C': 100000.0, 'estimator__class_weight': 'balanced', 'estimator__kernel': 'linear'}
Test Data UAR: 98.42
5.2 Identity prediction from scripted samples#
clf_names, clfs, params_clf = get_sklearn_models()
grid_results = {}
for i, (clf_name, clf, clf_params) in enumerate(zip(clf_names, clfs, params_clf)):
print(f'Step {i+1}/{len(clf_names)}: {clf_name}...')
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=_RANDOM_SEED)
pipeline = Pipeline([('transformer', StandardScaler()), ('estimator', clf)])
grid_search = GridSearchCV(pipeline, param_grid=clf_params, n_jobs=-1, cv=cv, scoring='recall_macro', error_score=0)
grid_result = grid_search.fit(train_script_features, train_script_labels)
grid_results[clf_name] = grid_result
test_result = grid_result.score(test_script_features, test_script_labels)
print(f'Best {clf_name} UAR: {grid_result.best_score_*100: .2f} using {grid_result.best_params_}')
print(f' Test Data UAR: {test_result*100: .2f}')
Step 1/3: LR...
Best LR UAR: 99.59 using {'estimator__C': 100.0, 'estimator__class_weight': 'balanced'}
Test Data UAR: 99.17
Step 2/3: RF...
Best RF UAR: 95.61 using {'estimator__class_weight': None, 'estimator__max_depth': 25, 'estimator__min_samples_split': 5}
Test Data UAR: 96.51
Step 3/3: SVC...
Best SVC UAR: 99.23 using {'estimator__C': 100000.0, 'estimator__class_weight': 'balanced', 'estimator__kernel': 'linear'}
Test Data UAR: 99.53
6) Gender Features in BYOL-S#
It is evident how the model is capable of separating gender properly as shown in the dimensionality reduction plots. Accordingly, we will explore the main BYOL-S features that identify gender and remove them to see if BYOL-S representation would still be capable of maintaining gender separation or would it shed more light on a different kind of acoustic variation.#
Methodology:#
Train 3 classifiers (Logistic Regression ‘LR’, Random Forest ‘RF’ and Support Vector Classifier ‘SVC’) to predict gender from BYOL-S embeddings.
Select the top important features in gender prediction for each trained model.
Extract the common features across the 3 classifiers.
Remove these features from the extracted embeddings and apply dimensionality reduction to observe changes.
Model Training: The training process constitutes running 5-fold CV on standardized inputs and reporting the best Recall score.#
6.1 Train Classifiers#
#binarize the gender label
gender_binary = pd.get_dummies(gender)
gender_binary = gender_binary.values
gender_binary = gender_binary.argmax(1)
#define classifiers' objects and fit dataset
clf_names, clfs, params_clf = get_sklearn_models()
grid_results = {}
for i, (clf_name, clf, clf_params) in enumerate(zip(clf_names, clfs, params_clf)):
print(f'Step {i+1}/{len(clf_names)}: {clf_name}...')
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=_RANDOM_SEED)
pipeline = Pipeline([('transformer', StandardScaler()), ('estimator', clf)])
grid_search = GridSearchCV(pipeline, param_grid=clf_params, n_jobs=-1, cv=cv, scoring='recall_macro', error_score=0)
grid_result = grid_search.fit(byols_embeddings, gender_binary)
grid_results[clf_name] = grid_result
print(f'Best {clf_name} UAR: {grid_result.best_score_*100: .2f} using {grid_result.best_params_}')
Step 1/3: LR...
Best LR UAR: 99.98 using {'estimator__C': 100000.0, 'estimator__class_weight': 'balanced'}
Step 2/3: RF...
Best RF UAR: 99.13 using {'estimator__class_weight': None, 'estimator__max_depth': 20, 'estimator__min_samples_split': 10}
Step 3/3: SVC...
Best SVC UAR: 99.96 using {'estimator__C': 0.001, 'estimator__class_weight': None, 'estimator__kernel': 'linear'}
6.2 Select the important features for gender prediction#
#select top k features from all classifiers
features = []; k=500
for clf_name in clf_names:
features_df = eval_features_importance(clf_name, grid_results[clf_name])
features.append(features_df.index[:k])
#get common features among selected top features
indices = reduce(np.intersect1d, (features[0], features[1], features[2]))
#create one array containing only the common top features (gender features) and another one containing the rest (genderless features)
gender_embeddings = byols_embeddings[:, indices]
genderless_embeddings = np.delete(byols_embeddings, indices, axis=1)
Extract important features from LR model:
Extract important features from RF model:
Extract important features from SVC model:
7) Acoustic Features Analysis in BYOL-S#
In this section, we will compute some acoustic features (F0 and loudness) from the audio files and see their distribution in the 2D dimensionality reduction plots.
import pyloudnorm as pyln
f0s = []; loudness = []; mffcc_1 = []; rms=[]
for file in tqdm(wav_files):
audio, orig_sr = sf.read(file)
# #measure the median fundamental frequency
# f0 = librosa.yin(audio, fmin=librosa.note_to_hz('C1'),
# fmax=librosa.note_to_hz('C7'), sr=orig_sr)
# f0s.append(np.nanmedian(f0))
# #measure the loudness
# meter = pyln.Meter(orig_sr) # create BS.1770 meter
# l = meter.integrated_loudness(audio)
# loudness.append(l)
# #measure the first mfcc
# mfccs = librosa.feature.mfcc(audio, sr=orig_sr)
# mffcc_1.append(np.nanmedian(mfccs[0,:]))
#measure rms
rms.append(np.nanmedian(librosa.feature.rms(audio)))
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5816/5816 [00:11<00:00, 519.77it/s]
with open("rms.pickle", "wb") as output_file:
pickle.dump(rms, output_file)
with open("f0s.pickle", "rb") as output_file:
f0s = np.array(pickle.load(output_file))
with open("loudness.pickle", "rb") as output_file:
loudness = np.array(pickle.load(output_file))
with open("mfcc_1.pickle", "rb") as output_file:
mfcc_1 = np.array(pickle.load(output_file))
with open("rms.pickle", "rb") as output_file:
rms = np.array(pickle.load(output_file))
Plotting the Median F0 of audio samples across 4 dimensionality reduction methods
fig, ax = plt.subplots(2, 4, figsize=(30, 15))
optimize = 'Global'
unique_labels = ['script', 'spon']
metrics = pd.read_csv('scriptvsspon_metrics.csv')
reducer_names, params_list = get_reducers_params()
for i, label in enumerate(unique_labels):
indices = list(np.where(labels == label)[0])
df = pd.read_csv(f'{label}_dataset.csv', header=[0,1,2])
df['f0'] = f0s[indices]; df['loudness'] = loudness[indices]
df['f0'] = df['f0'].mask(df['f0'] > 300, 300)
df.rename(columns={'Unnamed: 17_level_1': '', 'Unnamed: 17_level_2': '', 'Unnamed: 18_level_1': '', 'Unnamed: 18_level_2': '', 'Unnamed: 19_level_1': '', 'Unnamed: 19_level_2': ''},inplace=True)
for j, name in enumerate(reducer_names):
max_idx = metrics[optimize].loc[(metrics.Protocol==label)&(metrics.Method==name)].idxmax()
metric = [metrics['Local'].iloc[max_idx], metrics['Global'].iloc[max_idx]]
points = visualize_embeddings(df, 'f0', metrics=metric, axis=ax[i, j], opt_structure=optimize, red_name=name, plot_type='colorbar')
ax[i, 0].set_ylabel(label, fontsize=15)
cbar = fig.colorbar(points, ax=ax.ravel().tolist())
cbar.ax.set_ylabel('Median F0', rotation=270)
plt.show()
fig = make_subplots(rows=2, cols=4)
optimize = 'Global'
unique_labels = ['script', 'spon']
metrics = pd.read_csv('scriptvsspon_metrics.csv')
reducer_names, params_list = get_reducers_params()
for i, label in enumerate(unique_labels):
df = pd.read_csv(f'{label}_dataset.csv', header=[0,1,2])
indices = list(np.where(labels == label)[0])
df['f0'] = f0s[indices]; df['loudness'] = loudness[indices]
df['f0'] = df['f0'].mask(df['f0'] > 300, 300)
df.rename(columns={'Unnamed: 17_level_1': '', 'Unnamed: 17_level_2': '', 'Unnamed: 18_level_1': '', 'Unnamed: 18_level_2': '', 'Unnamed: 19_level_1': '', 'Unnamed: 19_level_2': ''},inplace=True)
for j, name in enumerate(reducer_names):
max_idx = metrics[optimize].loc[(metrics.Protocol==label)&(metrics.Method==name)].idxmax()
metric = [metrics['Local'].iloc[max_idx], metrics['Global'].iloc[max_idx]]
visualize_embeddings(df, 'f0', metrics=metric, axis=fig, opt_structure=optimize, red_name=name, plot_type='plotly', row=i+1, col=j+1, hovertext=df['wav_file'], label=label)
fig.update_layout(
autosize=False,
width=1600,
height=1200, showlegend=False,)
fig.show()
Plotting the Loudness of audio samples across 4 dimensionality reduction methods
fig, ax = plt.subplots(2, 4, figsize=(30, 15))
optimize = 'Global'
unique_labels = ['script', 'spon']
metrics = pd.read_csv('scriptvsspon_metrics.csv')
reducer_names, params_list = get_reducers_params()
for i, label in enumerate(unique_labels):
indices = list(np.where(labels == label)[0])
df = pd.read_csv(f'{label}_dataset.csv', header=[0,1,2])
df['f0'] = f0s[indices]; df['loudness'] = loudness[indices]
df['f0'] = df['f0'].mask(df['f0'] > 300, 300)
df.rename(columns={'Unnamed: 17_level_1': '', 'Unnamed: 17_level_2': '', 'Unnamed: 18_level_1': '', 'Unnamed: 18_level_2': '', 'Unnamed: 19_level_1': '', 'Unnamed: 19_level_2': ''},inplace=True)
for j, name in enumerate(reducer_names):
max_idx = metrics[optimize].loc[(metrics.Protocol==label)&(metrics.Method==name)].idxmax()
metric = [metrics['Local'].iloc[max_idx], metrics['Global'].iloc[max_idx]]
points = visualize_embeddings(df, 'loudness', metrics=metric, axis=ax[i, j], opt_structure=optimize, red_name=name, plot_type='colorbar')
ax[i, 0].set_ylabel(label, fontsize=15)
cbar = fig.colorbar(points, ax=ax.ravel().tolist())
cbar.ax.set_ylabel('Loudness', rotation=270)
plt.show()
fig = make_subplots(rows=2, cols=4)
optimize = 'Global'
unique_labels = ['script', 'spon']
metrics = pd.read_csv('scriptvsspon_metrics.csv')
reducer_names, params_list = get_reducers_params()
for i, label in enumerate(unique_labels):
df = pd.read_csv(f'{label}_dataset.csv', header=[0,1,2])
indices = list(np.where(labels == label)[0])
df['f0'] = f0s[indices]; df['loudness'] = loudness[indices]
df['f0'] = df['f0'].mask(df['f0'] > 300, 300)
df.rename(columns={'Unnamed: 17_level_1': '', 'Unnamed: 17_level_2': '', 'Unnamed: 18_level_1': '', 'Unnamed: 18_level_2': '', 'Unnamed: 19_level_1': '', 'Unnamed: 19_level_2': ''},inplace=True)
for j, name in enumerate(reducer_names):
max_idx = metrics[optimize].loc[(metrics.Protocol==label)&(metrics.Method==name)].idxmax()
metric = [metrics['Local'].iloc[max_idx], metrics['Global'].iloc[max_idx]]
visualize_embeddings(df, 'loudness', metrics=metric, axis=fig, opt_structure=optimize, red_name=name, plot_type='plotly', row=i+1, col=j+1, hovertext=df['wav_file'], label=label)
fig.update_layout(
autosize=False,
width=1600,
height=1200, showlegend=False,)
fig.show()
Plotting the median of first MFCC of audio samples across 4 dimensionality reduction methods
fig, ax = plt.subplots(2, 4, figsize=(30, 15))
optimize = 'Global'
unique_labels = ['script', 'spon']
metrics = pd.read_csv('scriptvsspon_metrics.csv')
reducer_names, params_list = get_reducers_params()
for i, label in enumerate(unique_labels):
indices = list(np.where(labels == label)[0])
df = pd.read_csv(f'{label}_dataset.csv', header=[0,1,2])
df['f0'] = f0s[indices]; df['loudness'] = loudness[indices]; df['mfcc_1'] = mfcc_1[indices]
df['f0'] = df['f0'].mask(df['f0'] > 300, 300)
df.rename(columns={'Unnamed: 17_level_1': '', 'Unnamed: 17_level_2': '', 'Unnamed: 18_level_1': '', 'Unnamed: 18_level_2': '', 'Unnamed: 19_level_1': '', 'Unnamed: 19_level_2': ''},inplace=True)
for j, name in enumerate(reducer_names):
max_idx = metrics[optimize].loc[(metrics.Protocol==label)&(metrics.Method==name)].idxmax()
metric = [metrics['Local'].iloc[max_idx], metrics['Global'].iloc[max_idx]]
points = visualize_embeddings(df, 'mfcc_1', metrics=metric, axis=ax[i, j], opt_structure=optimize, red_name=name, plot_type='colorbar')
ax[i, 0].set_ylabel(label, fontsize=15)
cbar = fig.colorbar(points, ax=ax.ravel().tolist())
cbar.ax.set_ylabel('Median MFCC 1', rotation=270)
plt.show()
fig = make_subplots(rows=2, cols=4)
optimize = 'Global'
unique_labels = ['script', 'spon']
metrics = pd.read_csv('scriptvsspon_metrics.csv')
reducer_names, params_list = get_reducers_params()
for i, label in enumerate(unique_labels):
df = pd.read_csv(f'{label}_dataset.csv', header=[0,1,2])
indices = list(np.where(labels == label)[0])
df['f0'] = f0s[indices]; df['loudness'] = loudness[indices]; df['mfcc_1'] = mfcc_1[indices]; df['rms'] = rms[indices]
df['f0'] = df['f0'].mask(df['f0'] > 300, 300)
df.rename(columns={'Unnamed: 17_level_1': '', 'Unnamed: 17_level_2': '', 'Unnamed: 18_level_1': '', 'Unnamed: 18_level_2': '', 'Unnamed: 19_level_1': '', 'Unnamed: 19_level_2': ''},inplace=True)
for j, name in enumerate(reducer_names):
max_idx = metrics[optimize].loc[(metrics.Protocol==label)&(metrics.Method==name)].idxmax()
metric = [metrics['Local'].iloc[max_idx], metrics['Global'].iloc[max_idx]]
visualize_embeddings(df, 'mfcc_1', metrics=metric, axis=fig, opt_structure=optimize, red_name=name, plot_type='plotly', row=i+1, col=j+1, hovertext=df['wav_file'], label=label)
fig.update_layout(
autosize=False,
width=1600,
height=1200, showlegend=False,)
fig.show()
Plotting the median of RMS of audio samples across 4 dimensionality reduction methods
fig, ax = plt.subplots(2, 4, figsize=(30, 15))
optimize = 'Global'
unique_labels = ['script', 'spon']
metrics = pd.read_csv('scriptvsspon_metrics.csv')
reducer_names, params_list = get_reducers_params()
for i, label in enumerate(unique_labels):
indices = list(np.where(labels == label)[0])
df = pd.read_csv(f'{label}_dataset.csv', header=[0,1,2])
df['rms'] = rms[indices]
df.rename(columns={'Unnamed: 17_level_1': '', 'Unnamed: 17_level_2': '', 'Unnamed: 18_level_1': '', 'Unnamed: 18_level_2': '', 'Unnamed: 19_level_1': '', 'Unnamed: 19_level_2': ''},inplace=True)
for j, name in enumerate(reducer_names):
max_idx = metrics[optimize].loc[(metrics.Protocol==label)&(metrics.Method==name)].idxmax()
metric = [metrics['Local'].iloc[max_idx], metrics['Global'].iloc[max_idx]]
points = visualize_embeddings(df, 'rms', metrics=metric, axis=ax[i, j], opt_structure=optimize, red_name=name, plot_type='colorbar')
ax[i, 0].set_ylabel(label, fontsize=15)
cbar = fig.colorbar(points, ax=ax.ravel().tolist())
cbar.ax.set_ylabel('Median RMS', rotation=270)
plt.show()
fig = make_subplots(rows=2, cols=4)
optimize = 'Global'
unique_labels = ['script', 'spon']
metrics = pd.read_csv('scriptvsspon_metrics.csv')
reducer_names, params_list = get_reducers_params()
for i, label in enumerate(unique_labels):
df = pd.read_csv(f'{label}_dataset.csv', header=[0,1,2])
indices = list(np.where(labels == label)[0])
df['rms'] = rms[indices]
df.rename(columns={'Unnamed: 17_level_1': '', 'Unnamed: 17_level_2': '', 'Unnamed: 18_level_1': '', 'Unnamed: 18_level_2': '', 'Unnamed: 19_level_1': '', 'Unnamed: 19_level_2': ''},inplace=True)
for j, name in enumerate(reducer_names):
max_idx = metrics[optimize].loc[(metrics.Protocol==label)&(metrics.Method==name)].idxmax()
metric = [metrics['Local'].iloc[max_idx], metrics['Global'].iloc[max_idx]]
visualize_embeddings(df, 'rms', metrics=metric, axis=fig, opt_structure=optimize, red_name=name, plot_type='plotly', row=i+1, col=j+1, hovertext=df['wav_file'], label=label)
fig.update_layout(
autosize=False,
width=1600,
height=1200, showlegend=False,)
fig.show()